feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks by titaiwangms · Pull Request #352 · microsoft/olive-recipes

titaiwangms · 2026-04-08T23:15:51Z

Summary

Adds a complete Olive recipe for exporting Ministral-3-3B-Instruct-2512 (Pixtral) as a 3-model VLM pipeline for ONNX Runtime GenAI:

Text decoder — Olive/ModelBuilder (GQA attention, YaRN RoPE, INT4 quantization)
Vision encoder — Mobius declarative export (Pixtral, dynamic H×W, 2D RoPE)
Embedding — Mobius export (token + image fusion)

Configurations

Component	CUDA	CPU
Text decoder	INT4 (`MatMulNBits`)	INT4 (`MatMulNBits`)
Vision encoder	FP16	INT4 (`MatMulNBits` via Olive)
Embedding	FP16	FP32

Benchmark Results (AI2D)

Configuration	Accuracy	Samples	Latency (s/sample)	Gap vs PyTorch
PyTorch FP32 (CPU)	72.00%	100	21.66	— baseline —
PyTorch FP16 (CUDA)	73.00%	200	0.20	— baseline —
ONNX CUDA (INT4 text + FP16 vision)	71.65%	200	0.11	−1.35 pp
ONNX CPU (INT4 text + FP32 vision)	71.13%	194	26.86	−0.87 pp
ONNX CPU (INT4 text + INT4 vision)	69.07%	194	33.28	−2.93 pp

All ONNX configs within expected INT4 precision gap (<5 pp). CUDA ONNX achieves 2× speedup over PyTorch CUDA FP16.

Key Features

_strip_unused_initializers() — removes dead weights from Olive INT4 output, reducing vision model from 1.7 GB → 220 MB (~90% size reduction)
_fix_gather_block_quantized() — preserves RoPE position cache through INT4 quantization by converting GatherBlockQuantized back to fp32 Gather
eval.py — AI2D benchmark tool comparing ONNX vs PyTorch baselines with per-sample logging
genai_config.json generation — auto-generates 3-model VLM runtime config with Pixtral image preprocessing

Dependencies

onnxruntime-genai PR #2076 — YaRN RoPE parity fixes (inv_freq, mscale, rope_theta fallback)
onnxruntime-genai PR #2077 — Mistral3 VLM support (C++ image processor, INT32 input_ids, context_length/max_length separation)
mobius PR [Scanner]: Ran scanner and update README.md #130 — Mistral3 vision/embedding export support

Known Limitations

CPU INT4 vision: language drift — INT4-quantized vision encoder occasionally produces embeddings causing wrong-language responses (e.g., Chinese instead of English on challenge.jpg). FP16 vision (CUDA) does not exhibit this.
Single-image only — multi-image inputs not yet supported
FP8 checkpoint — default HF model uses FP8 weights; use -BF16 variant for PyTorch baselines

Copilot

Pull request overview

Adds a new “builtin” export + inference recipe for mistralai/Ministral-3-3B-Instruct-2512, targeting ONNX Runtime GenAI by exporting the text decoder via Olive/ModelBuilder and the vision/embedding pieces via Mobius, plus generating the runtime genai_config.json/processor_config.json.

Changes:

Introduces an end-to-end export/config-generation script (optimize.py) and a GenAI inference example (inference.py).
Adds Olive configs for CPU/mobile (INT4) and CUDA (FP16), along with recipe metadata (info.yml) and docs (README.md).
Adds custom patched modeling code under codes/ intended to support ONNX export.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
mistralai-Ministral-3-3B-Instruct-2512/builtin/user_script.py	Adds model config constants (currently with import-time HF loading).
mistralai-Ministral-3-3B-Instruct-2512/builtin/requirements.txt	Declares Olive + Mobius + torch/transformers dependencies.
mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md	Documents export workflow, output layout, and inference usage.
mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py	Implements export pipeline and GenAI config/tokenizer patching.
mistralai-Ministral-3-3B-Instruct-2512/builtin/info.yml	Registers builtin recipe metadata (keywords/EPs/devices/name).
mistralai-Ministral-3-3B-Instruct-2512/builtin/inference.py	Provides a CLI to run text-only and multimodal inference with ORT GenAI.
mistralai-Ministral-3-3B-Instruct-2512/builtin/cuda/text.json	Olive ModelBuilder config for FP16 CUDA decoder export.
mistralai-Ministral-3-3B-Instruct-2512/builtin/cpu_and_mobile/text.json	Olive ModelBuilder config for INT4 CPU/mobile decoder export.
mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/modeling_ministral3.py	Adds patched model components for ONNX-export-friendly behavior.
mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/init.py	Exposes `Ministral3Model` symbol.
mistralai-Ministral-3-3B-Instruct-2512/builtin/.gitignore	Ignores generated model artifacts and Olive cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Complete olive recipe for Ministral-3-3B-Instruct-2512 VLM using: - Text decoder: Olive/ModelBuilder (INT4 for both CPU and CUDA) - Vision encoder + embedding: Mobius (dynamo-free ONNX construction) - Vision INT4 quantization: Olive post-export (CPU only) - context_length=32768, Permute3D transform in processor_config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add _strip_unused_initializers to reduce INT4 model size (1.7GB→220MB) - Add _fix_gather_block_quantized for RoPE cache preservation - CUDA: INT4 text + FP16 vision (71.65% AI2D) - CPU: INT4 text + INT4 vision (69.07% AI2D) - Remove unnecessary genai_config overrides (trust ModelBuilder) - Add comprehensive README with benchmark results - Fix eval.py build_messages for Jinja sort compatibility Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- eval.py: Add explanatory comments to except-pass clauses - optimize.py: Update docstring to match INT4 shipping config - optimize.py: Document _get_hf_config MODEL_NAME usage - optimize.py: Improve --dtype help text - README.md: Fix precision labels (CUDA=INT4 text, CPU embedding=FP16) - README.md: Remove stale FP32 embedding references Note: eval.py dtype= kwarg is valid in transformers >=5.0 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ation When --models-dir differs from the default (<config-dir>/models/), text.json output_dir is hardcoded so exports go to the default location. Copy the entire export tree to --models-dir after export so that update_genai_config() and fix_tokenizer() find the files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CUDA graph capture is unsupported for VLMs with dynamic image sizes. Set enable_cuda_graph=0 for ALL models (decoder, vision, embedding), matching the Qwen VLM recipe convention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Olive caches the full resolved config including absolute output_dir. On re-runs with different --models-dir, the stale cache writes to the old path, creating unexpected directories (e.g., ministral3-cpu-int4-test). Clear the cache before each quantization run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- export_text_decoder: Load text.json as dict, override output_dir - export_vision_and_embedding: Already accepts output_dir parameter - quantize_vision_and_embedding: Load vision.json as dict, override model_path and output_dir - Remove shutil.copytree post-export step from main() - Remove .olive-cache clear (no longer needed) - Pass models_dir through export_models() pipeline This eliminates duplicate directories, copy overhead for multi-GB files, and ghost directories from stale Olive cache paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 8, 2026 23:15

Copilot started reviewing on behalf of titaiwangms April 8, 2026 23:17 View session

titaiwangms marked this pull request as draft April 8, 2026 23:17

Copilot AI reviewed Apr 8, 2026

View reviewed changes

titaiwangms force-pushed the ministral-3b-text-export branch 2 times, most recently from 8058122 to 9d5c64b Compare April 9, 2026 21:31

titaiwangms changed the title ~~Add Ministral-3-3B-Instruct-2512 recipe~~ Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export Apr 9, 2026

titaiwangms force-pushed the ministral-3b-text-export branch 2 times, most recently from b3f8592 to 5969770 Compare April 10, 2026 00:04

titaiwangms marked this pull request as ready for review April 10, 2026 21:58

titaiwangms force-pushed the ministral-3b-text-export branch from d3f7f6a to 5eb675d Compare April 14, 2026 20:23

titaiwangms commented Apr 14, 2026

View reviewed changes

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py

titaiwangms force-pushed the ministral-3b-text-export branch 5 times, most recently from 9d5b928 to 7a914be Compare April 14, 2026 21:52

titaiwangms force-pushed the ministral-3b-text-export branch from 7a914be to 1bdc231 Compare April 14, 2026 22:08

github-code-quality bot found potential problems Apr 15, 2026

View reviewed changes

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py Fixed

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py Fixed

titaiwangms changed the title ~~Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export~~ feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks Apr 15, 2026

titaiwangms requested a review from Copilot April 15, 2026 22:22

Copilot started reviewing on behalf of titaiwangms April 15, 2026 22:23 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

titaiwangms and others added 5 commits April 15, 2026 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks#352

feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks#352
titaiwangms wants to merge 7 commits intomainfrom
ministral-3b-text-export

titaiwangms commented Apr 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

titaiwangms commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Configurations

Benchmark Results (AI2D)

Key Features

Dependencies

Known Limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

titaiwangms commented Apr 8, 2026 •

edited

Loading